Importing necessary libraries and data

Data Overview

Observations

  1. brand_name, os, 4g and 5g are categorical variables.
  2. screen_size, main_camera_mp, int_memory, ram, battery, weight, release_year, days_used, new_price and used_price are numerical variables.

Observation

  1. There are 3571 rows and 15 columns

Observations

  1. It is observed that main_camera_mp, selfie_camera_mp, int_memory, ram, battery and weight columns have missing values.
  2. brand_name, os, 4g and 5g are object variables.

Summary of the Dataset

Observation

  1. There is a huge difference between the 3rd quartile and the maximum value for most of the columns indicating there might be outliers to the right of these values

By default the describe() function shows only the summary of numeric variables only. Let's check the summary of non-numeric variables.

Observations

  1. brand_name has 34 unique variations. os column has 4 variations.
  2. 4g and 5g columns have 2 variations.

Let's check the count of each unique category in each of the categorical variables.

Observations

  1. Most of the customers prefer Android based phones.
  2. Most of the customers prefer 4g over 5g.

check for unique values in the column

Observations

  1. Screen_size of the used_phones vary from 2.7 to 46.36 cm. It is positively skewed.
  2. The resolutions main_camera_mp ranges from 0.08 to 48.0 megapixels.It is positively skewed.
  3. The resoltution of selfie_camera_mp ranges from 0.005 to 1024 megapixels. It is positively skewed.
  4. There is a wide range of Ram storage for the phones indicating the possiblities of outliers.
  5. There is a wide range of int_memory storage among the used_phones ranging between 0.005 and 1024GB indicating the possibilities of outliers.
  6. The used_price of the phones range between 2.50 euros and 1916.54 euros. It is positively skewed.
  7. The energy capacity of these used_phones range between 80mAh and 12000mAh. It is positively skewed.

Missing Values

Observations

  1. main_camera_mp has 180 misisng values.
  2. int_memory and ram have 10 missing value indicating some pattern.
  3. battery column has 6 missing values and weight column has 7 missing value.
  4. selfie_camera_mp has 2 missing value.

Exploratory Data Analysis (EDA)

Lets explore numerical values

Observations on screen_size

Observations

  1. The distribution of screen_size is positively skewed.
  2. The screen_size for most the used_phones fall between 10 and 20 cms.
  3. Median is 14cm and mean is 13cm.
  4. We have observatuons where the screen size is more than 40cm as well.

Observations on main_camera_mp

Observations

  1. The resolution of main_camera_mp ranges from 0 tp 48megapizels.
  2. Most of the used_phones have ~13megapizels resolution.

Observations on selfie_camera_mp

Observations

  1. The distribution is positively skewed.
  2. There are more used_phones with 5 megapixel resolution.
  3. There are also some phones with 30megapizel

Observations on int_memory

Observations

  1. Most of the used_phones have int_memory between 0-1GB.
  2. The distribution shows 3 outliers.
  3. There are used_phones with 1024GB

Observations on ram

Observations:

  1. Maximum number of the used_phones have 4GB RAM.
  2. The values other than 4GB are considered outliers.

Observations on Battery

Observations:

  1. The Battery is right skewed.
  2. There are many outliers.
  3. There are many used_phones with ~3000mAh battery

Observations on weight

Observations

  1. There are many phones whose weight is ~150 grams.
  2. There are also some heavy phones(~900 grams)
  3. There are some phones ~20 grams

Observations on release_year

Observations.

  1. The data contains used phones released in the market between 2013 and 2020.
  2. Maximum number of phones are released in 2014.

Observations on days_used

Observations

There are phones which are used for 600 - 1000 days

Observations on new_price

Observations.

The graph is right skewed. There are many phones ~100 -150 Euros price range. There are also high priced phones

Observations on used_price

Observations

  1. Maximum used_price is ~2000 Euros.
  2. There are many phones used used price is below 200 Euros.
  3. There distribution is right skewed and there are many outliers.

Let us explore categorical variables

Observations on OS

Observations

  1. Many phones in the market are Android based.

Observations on 4g

Observations

Many phones have 4g

Observations on 5g

Observation

Most of the used_phones donot have 5g

Bivariate Analysis

Observations

  1. There is high correlation between used_price and new_price, battery and screen_size, weight and used_price, release_year and selfie_camera_mp.
  2. There is also high correlation between weight and battery.
  3. Lease correlation is between main_camera_mp and screen_size, int_memory and screen_size, weight and main_camera_mp,selfie_camera_mp and weight.
  4. There is also least correlation between battery and int_memory, release_year and new_price.
  5. It is important to note that correlation does not imply causation.
  6. All are positively correlated.

Bivariate Scatter Plots

Observations

  1. There is high correlation between used_price and new_price, battery and screen_size, weight and used_price, release_year and selfie_camera_mp.
  2. There is also high correlation between weight and battery.
  3. Lease correlation is between main_camera_mp and screen_size, int_memory and screen_size, weight and main_camera_mp,selfie_camera_mp and weight.
  4. There is also least correlation between battery and int_memory, release_year and new_price.
  5. It is important to note that correlation does not imply causation.
  6. All are positively correlated.

Relationship between release_year and used_price

Observation

  1. The latest phones have higher used_price

Relationship between used_price and days_used

Observations

Most of the used_phones fall below 250 Euros price range

Relationship between used_price and new_price

Observations

The used_price and new_price are linearly related.

Relationship between used_price and ram

Observations

Most of the used_phones have Ram between 4-12GB

Relationship between used_price and int_memory

Observations

int_memory and used_price are linearly related.

Relationship between used_price and weight

Observations

Most of the used_phones are below 200 grams and their price range is below 500 Euros.

Relationship between used_price and battery

Observations

Most of the used phones have 4000mAh battery. There are also phones with 10000mAh.

Relationship between used_price and main_camera_mp

Observations

main_camera_mp and used_price are linearly related. Most of the used phones have 8-14GB RAM.

Relationship between used_price and selfie_camera_mp

Observation

selfie_camera_mp and used_price are linearly related.

Relationship between used_price and screen_size

Observations

The screen_size of the highest price phone is ~20cm. The screen size of the used_phones are below 35 cms. There are phones whose screen_size is 44cm

Check os by brand_name

Observations

  1. Most of the phones are Android based. There are very less iOS based used phones in the market

check os by 4g

Observations

  1. Most of the Android phones have 4g
  2. Most of the Others os based phones donot have 4g

Check os by 5g

Observations

  1. Very few Android based phones have 5g
  2. ALl the other, Windows and iOS os based phones have no 5g

check 5g on brand_name

Observations

Most of the used_phones have no 5g

Check 4g on 5g

Observations

Very few phones have both 4g and 5g

Used_price vs os

Observations

  1. iOS based used_phones have high used_price.
  2. Other os based and windows based used_phones have low used_price

Used_price vs brand_name

Observations

  1. onePlusOne brand phone has high used_price followed by Apple and then Google

Used_price vs 4g

Observations

  1. The used_phones with 4g enables have high used_price

Used_price vs 5g

Observations

The used_phones with 5g enables have high used_price.

Correlation used_price, brand_name and os

Observations

  1. Apple phone that is Other os based has higher price.
  2. Android based oneplusone phone is highly priced.
  3. ALL the brands that are Windows based have less used_price.
  4. Apple is the only iOS based phone.

Correlation used_price, brand_name and 4g

Observation

  1. Among all the brands, the used_phones that have 4g have more used_price over the phone that donot have 4g

Correlation used_price, brand_name and 5g

Observations

1.Most of the Huawei, LG, Meizu, Motorola, Nokia, oneplusone,Oppo, Samsung, Vivo, Xiaomi, ZTE company phones have 5g are highly priced

Correlation ram, brand_name and 4g

Observations

There seems to be not much relation between RAM and 4g. Though among some brands, that have 4g have more RAM.

Correlation ram, brand_name and 5g

Though among some brands, that have 5g reguire more RAM.

Correlation ram, brand_name and os

Observations

ANdroid phones seems to be having more RAM followed by Others then Windows

Questions:

  1. What does the distribution of used phone prices look like?
  2. What percentage of the used phone market is dominated by Android devices?
  3. The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?
  4. A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?
  5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?
  6. Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?
  7. Which attributes are highly correlated with the used phone price?

What does the distribution of used phone prices look like?

Observations

Maximum used_price is ~2000 Euros. There are many phones used used price is below 200 Euros. There distribution is right skewed and there are many outliers.

What percentage of the used phone market is dominated by Android devices?

Observation

About 90.9% of the used_phones are dominated by Android devices

The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?

Observation

  1. RAM seems to be discrete in nature.
  2. Most of the phones have 4GB RAM.
  3. Vivo, Samsung, OnePlusOne, Lenovo,LG, Huawei, Honor brand phones have RAM more than 5GB
  4. There seems to be very little relation or no relation between RAM and brand_name.

A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?

Observations

  1. Heavy Battery phones seems to be heavy than the light battery phones.

Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?

Observations

There are 3450 used_phones that have scree_size greater than 6 inches.

Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?

Observations

  1. The graph is right skewed.
  2. There are many used_phones whose selfie_camera_mp value is 16Megapixel value.
  3. There are also some ohone with 32megapixel selfie_camera_mp.

Which attributes are highly correlated with the used phone price?

Observations

  1. new_price is highly correlated with used_price with 0.93 correlation
  2. ram and used_price have 0.52 correlation. selfie_camera_mp and used_price have 0.5 correlation

Data Preprocessing

Lets fix the missing values

For the target variable(used_price) we will drop the missing values. For the predictor variables, we will replace missing values in each column with its median.

Observations

Missing values have been treated.

Observations

  1. brand_name has 34 unique values.
  2. 4g and 5g have 2 uniques values, yes or no.
  3. Released year of phones range between 2013 and 2020.
  4. os has 4 unique values, i.e., android, others, iOS and Windows.
  5. Average used_price is 109.880277Euros

Univariate Analysis

Exploring independent variable.

Observations

  1. used_price is right skewed, some phones are high prices
  2. Mean used_price is around 110 Euros

Feature Engineering

Brand_name column has 34 unique variables. Lets group all brand names based on new_price

Observations

  1. There seeems to be more used_phones of low brand_type(3325 used_phones)

Observations

  1. Most of the used-phones in the market are low brand-type

Outlier detection and treatment

Observations

  1. There are lower outliers in screen_size and weight.
  2. There are no outliers in days_used,release_year.
  3. The other numerical columns have upper outliers.
  4. We will treat these outliers as these might adverselt affect the predictive power of linear model.

Observations

  1. Now, the outliers are all treated

EDA

Observation on screen_size

Observation on main_camera_mp

Observations on selfie_camera_mp

Observations on RAM

Observations on battery

Observations on weight

Observations on release_year

Observations on days_used

Observations on new_price

Observations on used_price

Lets explore Categorical Variables

Observations on 4g

Observations on 5g

Observations on brand_type

BiVariate Analysis

Bivariate Scatter Plots

Relationship between used_price and days_used

Relationship between used_price and new_price

Relationship between used_price and ram

Relationship between used_price and int_memory

Relationship between used_price and weight

Relationship between used_price and battery

Relationship between used_price and main_camera_mp

Relationship between used_price and selfie_camera_mp

Relationship between used_price and screen_size

Check brand_type on os

Observations

  1. Andoid phones are mostly low brand_type
  2. iOS phones are high brand_type

check brand_type on 5g

Observations

Observations Very few Android based phones have 5g ALl the other, Windows and iOS os based phones have no 5g

check brand_type on 5g

Observations

There are many low brand_type phones that donot have 5g. High end phones donot have 5g

check 4g on 5g

Observations

Very few phones have both 4g and 5g

Used_price vs os

Observations

iOS based used_phones have high used_price. Other os based and windows based used_phones have low used_price

Used_price vs 4g

The used_phones with 4g enables have high used_price

Used_price vs 5g

Observations

The used_phones with 5g enables have high used_price.

Correlation used_price, brand_type and os

Observations

  1. Among the high brand_type phones, Android ones have more used_price.
  2. Among the medium brand_type phones, all types of os based phones are nearly equally priced
  3. Among the low brand_type phones, others os types are less pricy

Correlation used_price, brand_type and 4g

Observation

  1. All the high brand_type phones have 5g and are highly priced.
  2. The low brand_type phones that have no 4g are less priced.

Correlation used_price, brand_type and 5g

Observations

  1. Many low brand_type phones have 5g.
  2. high brand type phones with and without 5g are equally priced

Questions:

  1. What does the distribution of used phone prices look like?
  2. What percentage of the used phone market is dominated by Android devices?
  3. The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?
  4. A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?
  5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?
  6. Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?
  7. Which attributes are highly correlated with the used phone price?

What does the distribution of used phone prices look like?

Observations

Maximum used_price is ~250 Euros. There are many phones used used price is below 100 Euros. There distribution is right skewed.

What percentage of the used phone market is dominated by Android devices?

Observation

About 90.9% of the used_phones are dominated by Android devices

The amount of RAM is important for the smooth functioning of a phone. How does the amount of RAM vary with the brand?

Observation

RAM seems to be discrete and constant in nature. All the phone have range between 3.5GB and 4.5GB Others brand have maximum number of used_phones.

A large battery often increases a phone's weight, making it feel uncomfortable in the hands. How does the weight vary for phones offering large batteries (more than 4500 mAh)?

Observation

Heavy Battery phones seems to be heavy than the light battery phones.

Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones are available across different brands with a screen size larger than 6 inches?

Observation

Now the number of big screens phones larger than 6 inches have increased. There are 3571 used_phones that have scree_size greater than 6 inches.

Budget phones nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget phones offering greater than 8MP selfie cameras across brands?

Observation

  1. The graph is left skewed.
  2. Now the range has reduced to 9-17megapixel
  3. There are many used_phones with 16 megapixel followed by 17 megaoixel

Which attributes are highly correlated with the used phone price?

Observations

  1. new_price is highly correlated with used_price with 0.91 correlation
  2. int_memory and used_price have 0.59 correlation. selfie_camera_mp and used_price have 0.64 correlation

Building a Linear Regression model

Linear Model Building

  1. Our aim is to predict the price of the used phone.
  2. Before proceeding we will have to encode the categorical features.
  3. We split the data into test and train to evaluate and build train data.
  4. Build a linear regression model and check its performance

Observations

There are 2499 rows in train data and 1072 rows in test data

Checking the coefficients and intercepts of the model

Model performance evaluation

Observations

  1. The training R2 is 95.5%, indicating the model explaind 95.5% of the variation in the train data. So, model is not underfitting.
  2. MAE and RMSE are comparable indicating model is not underfitting.
  3. MAE indicates that the current model is able to predict used_price within a mean error of 10.16 Euros on the test data.
  4. MAPE on the test data suggests we can predict within 16.6% of the used_price.

Linear Regression using StatsModel

Observations

  1. Negative value of coefficients show used_price decreases with increase of corresponding attribute value. Positive values of coefficients show used_price increases with increase of corresponding values
  2. p-value of the variable indicates if the variable is significant or not. Lets consider significance level to be 5%, then anyvalue with a p-value less than 5% would be considered significant. But, these varibles might contain multicollinearity which might affect the p-values.
  3. Lets deal with the multicollinearity and verify other assumptions of linear regression followed by p-values

Checking Linear Regression Assumptions

Checking for the below assumptions:

  1. No Multicollinearity

  2. Linearity of variables

  3. Independence of error terms

  4. Normality of error terms

  5. No Heteroscedasticity

1. Test for Multicollinearity

Observations

  1. The VIF of the attributes are mostly between 1-5
  2. Looking at the VIF, there seem to be low multicollinearity between the attributes.

But, there are some attributes with p-value>0.05. Hence, looping through the code and dropping all the columns with p-value >0.05

Now no feature had p-value greater than 0.05. So, lets consider x_train2 as the final data and olsmod2 as final model

Observations

  1. Now adjusted R-squared is 0.985 ie, the model is able to explain 98.5% of the variance.
  2. The adjusted R-squared of olsmod0 was 0.955. This shows that the variables we dropeed effected by 3%.

Test for Linearity and Interdependence

Observation

  1. The above plot shows the distribution of residuals vs fitted values.
  2. Since, we donot see any pattern, the assumptions of linearity and independence are satisfied.

Test for Normality

Observations

The histogram of residuals does have bell shape

Observation

The residuals follow a straight line except tails. Check for Shapiro-Walk test

Observation

  1. p-value<0.05, the residuals are not as per Shapiro-Walk test
  2. The residuals are not normal.
  3. However, as an approximation, we accept the distribution as normal.
  4. Hence, assumption is satisfied.

Test for HOMOSCEDASTICITY

Observation

  1. P-value >0.05. Hence, Homoscedastic. The assumption is satisfied

Now, Moving on to Prediction Part.

Observation

  1. We observe that the actual and the predicted values are comparable. Lets visualize the result as bar graph.

Observations

  1. The model is able to see 95% of the variation in the data
  2. The MAPE on the test set suggests we predict 16.58% of the used_price.
  3. Hence, we can conclude that olsmod2 is a good model.

Lets compare the initial sklearn model and final statsmodel

Observations

The performance of both the models seem to be colse to each other

Observation

Both the models seem to be similar

Final Model Summary

Actionable Insights and Recommendations

-

  1. The number of days(days_used) used came out to be significant as expected. As the number of days the phone was used increases, the used_price decreases.
  2. The factors like main_camera_mp, os_Others, 4g_yes, decreases the used_price.
  3. Selfie_camera_mp, int_memory, release_year, new_price, os_iOS have positiove coefficients

The Linear regression model is able to explain 98.5% of the variance with a confidence level of 95%.

Based on the model we can say that

  1. Selfie camera resolution, internal memory, release year, new price have a positive effect on Used price. Every unit increase in selfie camera resolution results in 0.67 times increase in used price. A unit increase in new price effects a 0.4 units increase in used price. An iOS device has 8.6 units increase in used price. Every unit increase in internal memory results in 0.1 units increase in used price. A unit increase in release year causes a 0.03 unit increase in used price.

  2. Main camera resolution, days used, Operating system other than Android, iOS, Windows, 4g capability and brand type of medium/high has a negative effect on Used price. Every unit increase in Main camera resolution results in 0.33 units decrease in used price. Every unit increase in days used causes a 0.08 unit decrease in used price. A non iOS/Android/Windows device has 5 units decrease in used price. A 4g enabled device has 3.66 units less price. Brands of type Medium (I.e with New price between 500 and 1500) have 14 units decrease in used price. Brands of type High (I.e with New price between 1500 and 3000) have 26 units decrease in used price.

Hence to get most revenue, concentrate on

  1. phones that have better selfie camera resolution
  2. phones that have higher new price.
  3. iOS phones
  4. Phones with higher internal memory.
  5. More recent phones

Following categories may not fetch much revenues:

  1. Non iOS/Android/Windows phones
  2. Phones with only high Main camera resolution.
  3. Phones with new prices between 500 and 3000

Additionally, Given there are very less iOS phones, higher used prices can be set for such phones. But they may cater only to a niche segment. Even though Non iOS/Android/Windows phones are less, their predicted prices are also less. Hence dont invest in sales of these models. Most of the used_phones fall below 250 Euros price range. Hence this will be sector to concentrate for mass sales. Higher prices can be set for phones wiht higher selfie camera resolution which seems to be the latest trend. Some incentives / discounts can be given on phones with just higher main camera resolution as prices are not inclined towards such phones.